Text Processing
   HOME

TheInfoList



OR:

In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text. ''Text'' usually refers to all the alphanumeric characters specified on the keyboard of the person engaging the practice, but in general ''text'' means the abstraction layer immediately above the standard
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
of the target text. The term ''processing'' refers to automated (or mechanized) processing, as opposed to the same manipulation done manually. Text processing involves computer commands which invoke content, content changes, and cursor movement, for example to * search and replace * format * generate a processed report of the content of, or * filter a file or report of a text file. The text processing of a
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
is a virtual editing machine, having a primitive programming language that has named registers (identifiers), and named positions in the sequence of characters comprising the text. Using these, the "text processor" can, for example, mark a region of text, and then move it. The text processing of a ''
utility As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosopher ...
'' is a
filter program A filter is a computer program or subroutine to process a stream, producing another stream. While a single filter can be used individually, they are frequently strung together to form a pipeline. Some operating systems such as Unix are rich wit ...
, or ''filter''. These two mechanisms comprise text processing.


Definition

Since the standardized markup such as
ANSI escape code ANSI escape sequences are a standard for in-band signaling to control cursor location, color, font styling, and other options on video text terminals and terminal emulators. Certain sequences of bytes, most starting with an ASCII escape charac ...
s are generally invisible to the editor, they comprise a set of transitory properties that become at times indistinguishable from
word processing A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consen ...
. But the definite distinctions from word processing are that text processing proper: * represents "text processing utilities", not just "text editing" applications. * is much more "the keyboard way", as opposed to "the mouse way" (e.g. drag and drop, cut and paste) of initiating an edit. * is sequential access rather than random access in approach. * operates directly at the
presentation layer In the seven-layer OSI model of computer networking, the presentation layer is layer 6 and serves as the data translator for the computer network, network. It is sometimes called the syntax layer. Description Within the service layering semanti ...
rather than indirectly at the
application layer An application layer is an abstraction layer that specifies the shared communications protocols and Interface (computing), interface methods used by Host (network), hosts in a communications network. An ''application layer'' abstraction is speci ...
. * works raw data that is standardized and works more openly rather than tending towards any proprietary methods. In this way markup such as font and color are not really a distinguishing factor, because the character sequences that affect font and color are simply standard characters inserted automatically by a ''background text processing'' mode, made to work transparently by compliant text editors, yet becoming otherwise visible as ''text processing commands'' when that mode is not in effect. So text processing is defined most basically (but not entirely) around the visual characters (or
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...
s) rather than the standard, yet invisible characters.


History

The development of computer text processing started in earnest with Kleene's formalizing what is a ''regular language''. Such ''
regular expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" o ...
'' could then become a mini-program, complete with a compilation process, available to perform any edit, once that language was extended. Similarly, ''filters'' are extended by evolving particular '' options''.


Basic concepts

An editor essentially invokes an input stream and directs it to the text processing environment, which is either a
command shell In computing, a shell is a computer program that exposes an operating system's services to a human user or other programs. In general, operating system shells use either a command-line interface (CLI) or graphical user interface (GUI), depending ...
or a
text editor A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be us ...
. The resulting output is applicable to further text processing, the final result of which is comparable to a single application of an algorithm applied ''once'' by a more sophisticated and structured computer program. Text processing is, unlike an algorithm, a manually administered sequence of simpler macros that are the pattern-action expressions and filtering mechanisms. In either case the programmer's intention is impressed indirectly upon a given set of textual characters in the act of text processing. The results of a text processing step are sometimes only hopeful, and the attempted mechanism is often subject to multiple drafts through visual feedback, until the
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
or markup language details, or until the utility options, are fully mastered. Text processing is concerned mostly with producing textual characters at the highest level of computing, where its activities are just below the practical uses of computing—the ''manual'' transmission of information. Ultimately all computing is text processing, from the self-compiling textual characters of an assembler, through the automated programming language generated to handle a blob of graphical data, and finally to the metacharacters of regular expressions which groom existing text documents. Text processing is its own automation.


Characters

Textual characters come in standardized character sets containing also control characters such as newline characters, which arrange text. Other types of control characters arrange the transmission, define the character sets, and perform other housekeeping tasks.


See also

*
Text editor A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be us ...
*
List of Unix commands This is a list of Unix commands as specified by IEEE Std 1003.1-2008, which is part of the Single UNIX Specification (SUS). These commands can be found on Unix operating systems and most Unix-like operating systems. List See also * List of G ...


External links


The subject matter of the book
''Automatic Text Processing'' by
Gerard Salton Gerard A. "Gerry" Salton (8 March 1927 in Nuremberg – 28 August 1995) was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time, an ...

Database with Text Processing Tools
(2013-10-23)
Content analysis software
Software for Content Analysis.

Online Text processing tools. {{Natural language processing Text Unix text processing utilities